This is a small sample of data (both in terms of observations and available variables) that is actually used for training our models in *****. All the IDs are replaced, and all datapoints have a slight noise added to them to avoid any chance of identification. Data is stored as a small SQLite database stored in attached session_11.db file. There are two tables - dataset and metadata. The data sample spans 1 year and 30k observations.
import sqlite3
import sweetviz
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
con = sqlite3.connect('./session_11.db')
df_metadata = pd.read_sql_query('SELECT * FROM metadata;', con)
df_metadata
| varcode | name | var_class | entity_table | |
|---|---|---|---|---|
| 0 | a8404 | RatioOfCustomersAtAddressWithSuccessfullyClosedCasesLast36Months | variable | address |
| 1 | ap090 | TargetAmount90Days | target | case |
| 2 | c0001 | OriginalCapitalOfCaseInvoices | variable | case |
| 3 | c0015 | AmountOfCase | variable | case |
| 4 | c0019 | AgeOfDebt | variable | case |
| 5 | c0031 | NumberOfTelephonesCI | variable | case |
| 6 | c0039 | IndustryCode | variable | case |
| 7 | c0044 | ClientName | variable | case |
| 8 | c9008 | CustomerAge | variable | case |
| 9 | ct090 | Target90Days | target | case |
| 10 | b0007 | AmountOfCustomerPaymentsOnAllCasesBlevel | variable | debtor |
| 11 | d0009 | AmountOfCustomerOpenCases | variable | debtor_oc |
| 12 | d0012 | LastOriginalClosingCodeOfCustomer | variable | debtor_oc |
| 13 | d0027 | NumberOfCustomerIncomingCallDatesTee | variable | debtor_oc |
| 14 | d1205 | NumberOfSuccessfullyClosedCasesInLast24Months | variable | debtor_oc |
| 15 | d2112 | NumberOfCustomerPaymentsInLast12Months | variable | debtor_oc |
| 16 | d0031 | NumberOfUnsuccessfullyClosedCustomerCasesLast36Months | variable | debtor_oc |
ct090 — Target
case_id — Unique identifier
keydate — point in time when some event has happened in the lifecycle of a case + date relative to which all of the backward looking variables and forward looking targets are calculated. In this case, it's a general purpose propensity to pay model, which means that it is a freshly registered case, where all the relevant data has been gathered and verified. In other words, keydate is set a few days after registration, and target ct090 is checking for outcome in 90 days (ap090 is a similar regression target), while all the rest of the data is only looking backwards!
Metadata gives some basic description of variables. The general naming convention is based on prefixes that define aggregation levels - cXXXX looking at the data of this case only, dXXXX looking at other cases of same debtor, bXXXX looking at all cases of the debtor, aXXXX looking at all the cases on the same address. This is not very relevant for this particular task, but gives some idea of our data setup here in ******! Note that this data selection has quite a few variables with the dXXXX prefix, which means that this selection is specifically looking at debtors that we already had worked with before, therefore, variable selection is much broader and models are generally better.
One more tip on interpretation of missing values: if variable is bound by time window, e.g. d2112 NumberOfCustomerPaymentsInLast12Months, the NA value implies that there never have been any values, while 0 would mean that have been no values within bounding period (in this case 12 months). In other words, 0 and NA have different interpretation. It may or may not be relevant, depending on the choice of the modelling approach.
print("Type of data:", type(df_metadata))
print("\nDimensions: \nNumber of rows:",df_metadata.shape[0], "\nNumber of columns:",df_metadata.shape[1])
df_metadata
Type of data: <class 'pandas.core.frame.DataFrame'> Dimensions: Number of rows: 17 Number of columns: 4
| varcode | name | var_class | entity_table | |
|---|---|---|---|---|
| 0 | a8404 | RatioOfCustomersAtAddressWithSuccessfullyClosedCasesLast36Months | variable | address |
| 1 | ap090 | TargetAmount90Days | target | case |
| 2 | c0001 | OriginalCapitalOfCaseInvoices | variable | case |
| 3 | c0015 | AmountOfCase | variable | case |
| 4 | c0019 | AgeOfDebt | variable | case |
| 5 | c0031 | NumberOfTelephonesCI | variable | case |
| 6 | c0039 | IndustryCode | variable | case |
| 7 | c0044 | ClientName | variable | case |
| 8 | c9008 | CustomerAge | variable | case |
| 9 | ct090 | Target90Days | target | case |
| 10 | b0007 | AmountOfCustomerPaymentsOnAllCasesBlevel | variable | debtor |
| 11 | d0009 | AmountOfCustomerOpenCases | variable | debtor_oc |
| 12 | d0012 | LastOriginalClosingCodeOfCustomer | variable | debtor_oc |
| 13 | d0027 | NumberOfCustomerIncomingCallDatesTee | variable | debtor_oc |
| 14 | d1205 | NumberOfSuccessfullyClosedCasesInLast24Months | variable | debtor_oc |
| 15 | d2112 | NumberOfCustomerPaymentsInLast12Months | variable | debtor_oc |
| 16 | d0031 | NumberOfUnsuccessfullyClosedCustomerCasesLast36Months | variable | debtor_oc |
Looking at this dataset I came to conclusion that I would need the names to replace the code in the main dataset. That is why I will modify the names and make it shorter.
# function to convert camel case to snake case
def camel_to_snake(name):
import re
s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1).lower()
# rename code names using the camel_to_snake function
df_metadata['name'] = df_metadata['name'].apply(camel_to_snake)
# print the modified DataFrame
pd.set_option('display.max_colwidth', None)
df_metadata.name
0 ratio_of_customers_at_address_with_successfully_closed_cases_last36_months 1 target_amount90_days 2 original_capital_of_case_invoices 3 amount_of_case 4 age_of_debt 5 number_of_telephones_ci 6 industry_code 7 client_name 8 customer_age 9 target90_days 10 amount_of_customer_payments_on_all_cases_blevel 11 amount_of_customer_open_cases 12 last_original_closing_code_of_customer 13 number_of_customer_incoming_call_dates_tee 14 number_of_successfully_closed_cases_in_last24_months 15 number_of_customer_payments_in_last12_months 16 number_of_unsuccessfully_closed_customer_cases_last36_months Name: name, dtype: object
df_metadata.replace({'name' :
{
"ratio_of_customers_at_address_with_successfully_closed_cases_last36_months": "customers_at_address/success_closed_cases_36M",
"original_capital_of_case_invoices": "original_capital",
"number_of_telephones_ci":"no.telephones",
"amount_of_customer_payments_on_all_cases_blevel": "cust_payments_all_cases",
"last_original_closing_code_of_customer": "last_original_closing_code",
"number_of_customer_incoming_call_dates_tee":"cust_incoming_call_dates",
"number_of_successfully_closed_cases_in_last24_months": "success_closed_cases_24M",
"number_of_customer_payments_in_last12_months":"cust_payments_12M",
"number_of_unsuccessfully_closed_customer_cases_last36_months": "failed_closed_cust_cases_36M"
}}, inplace=True)
df_metadata
| varcode | name | var_class | entity_table | |
|---|---|---|---|---|
| 0 | a8404 | customers_at_address/success_closed_cases_36M | variable | address |
| 1 | ap090 | target_amount90_days | target | case |
| 2 | c0001 | original_capital | variable | case |
| 3 | c0015 | amount_of_case | variable | case |
| 4 | c0019 | age_of_debt | variable | case |
| 5 | c0031 | no.telephones | variable | case |
| 6 | c0039 | industry_code | variable | case |
| 7 | c0044 | client_name | variable | case |
| 8 | c9008 | customer_age | variable | case |
| 9 | ct090 | target90_days | target | case |
| 10 | b0007 | cust_payments_all_cases | variable | debtor |
| 11 | d0009 | amount_of_customer_open_cases | variable | debtor_oc |
| 12 | d0012 | last_original_closing_code | variable | debtor_oc |
| 13 | d0027 | cust_incoming_call_dates | variable | debtor_oc |
| 14 | d1205 | success_closed_cases_24M | variable | debtor_oc |
| 15 | d2112 | cust_payments_12M | variable | debtor_oc |
| 16 | d0031 | failed_closed_cust_cases_36M | variable | debtor_oc |
df_dataset = pd.read_sql_query('SELECT * FROM dataset;', con)
print("Type of data:", type(df_dataset))
print("\nDimensions: \nNumber of rows:",df_dataset.shape[0], "\nNumber of columns:",df_dataset.shape[1])
df_dataset
Type of data: <class 'pandas.core.frame.DataFrame'> Dimensions: Number of rows: 30000 Number of columns: 19
| case_id | keydate | ct090 | ap090 | c0001 | c0039 | c0044 | d0031 | b0007 | d0009 | c0031 | a8404 | c0019 | d0027 | c9008 | d2112 | d0012 | d1205 | c0015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2017-08-12 00:00:00.0 | 0.0 | 0.0 | 221.68 | K6622 | 1 | 2.0 | 0.00 | 238.38 | 2.0 | NaN | 98.0 | 0.0 | 49.0 | 0.0 | 1 | NaN | 222.69 |
| 1 | 2 | 2017-02-03 00:00:00.0 | 0.0 | 0.0 | 151.36 | K6512 | 2 | NaN | 210.53 | 0.00 | 5.0 | NaN | 109.0 | 2.0 | 51.0 | 0.0 | 2 | 1.0 | 212.72 |
| 2 | 3 | 2017-02-17 00:00:00.0 | 0.0 | 0.0 | 48.84 | K6512 | 3 | 1.0 | NaN | 0.00 | 2.0 | 0.00 | 748.0 | 0.0 | 48.0 | NaN | 3 | NaN | 56.84 |
| 3 | 4 | 2017-09-18 00:00:00.0 | 0.0 | 0.0 | 413.15 | K6622 | 4 | NaN | NaN | 54.14 | 3.0 | 1.00 | 8.0 | 0.0 | 27.0 | NaN | 4 | NaN | 463.15 |
| 4 | 5 | 2017-07-22 00:00:00.0 | 0.0 | 0.0 | 125.83 | K6512 | 5 | 2.0 | 20.00 | 83.00 | 6.0 | NaN | 324.0 | 0.0 | 40.0 | 0.0 | 1 | NaN | 146.09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29995 | 29996 | 2017-05-22 00:00:00.0 | 0.0 | 0.0 | 435.46 | K6512 | 27 | NaN | 978.62 | 242.06 | 3.0 | 1.00 | 4.0 | 0.0 | 40.0 | 3.0 | 7 | 1.0 | 435.46 |
| 29996 | 29997 | 2017-08-20 00:00:00.0 | 1.0 | 188.4 | 344.07 | K6512 | 10 | NaN | NaN | 25408.75 | 2.0 | 0.27 | 111.0 | 1.0 | 40.0 | NaN | 4 | NaN | 372.48 |
| 29997 | 29998 | 2017-06-11 00:00:00.0 | 0.0 | 0.0 | 417.23 | K6512 | 6 | 1.0 | NaN | 0.00 | 3.0 | 0.09 | 103.0 | 0.0 | 41.0 | NaN | 1 | NaN | 516.45 |
| 29998 | 29999 | 2017-02-17 00:00:00.0 | 0.0 | 0.0 | 529.00 | K6512 | 11 | NaN | 101.90 | 0.00 | 2.0 | NaN | 199.0 | 0.0 | 54.0 | 0.0 | 6 | 1.0 | 544.00 |
| 29999 | 30000 | 2017-08-10 00:00:00.0 | 0.0 | 0.0 | 174.23 | K6419 | 36 | NaN | 56.62 | 0.00 | 1.0 | 0.00 | 7.0 | 0.0 | 33.0 | 1.0 | 7 | 1.0 | 174.65 |
30000 rows × 19 columns
#Basic Information about dataset
df_dataset.info()
df_dataset.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 30000 entries, 0 to 29999 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 case_id 30000 non-null int64 1 keydate 30000 non-null object 2 ct090 30000 non-null float64 3 ap090 30000 non-null float64 4 c0001 29975 non-null float64 5 c0039 30000 non-null object 6 c0044 30000 non-null object 7 d0031 17371 non-null float64 8 b0007 13074 non-null float64 9 d0009 30000 non-null float64 10 c0031 30000 non-null float64 11 a8404 19599 non-null float64 12 c0019 30000 non-null float64 13 d0027 30000 non-null float64 14 c9008 27016 non-null float64 15 d2112 13048 non-null float64 16 d0012 30000 non-null object 17 d1205 8762 non-null float64 18 c0015 30000 non-null float64 dtypes: float64(14), int64(1), object(4) memory usage: 4.3+ MB
| case_id | ct090 | ap090 | c0001 | d0031 | b0007 | d0009 | c0031 | a8404 | c0019 | d0027 | c9008 | d2112 | d1205 | c0015 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 30000.000000 | 30000.000000 | 30000.000000 | 29975.000000 | 17371.000000 | 13074.000000 | 30000.000000 | 30000.000000 | 19599.000000 | 30000.000000 | 30000.000000 | 27016.000000 | 13048.000000 | 8762.000000 | 30000.000000 |
| mean | 15000.500000 | 0.168733 | 49.215836 | 538.590694 | 1.634391 | 423.156043 | 555.063823 | 3.092900 | 0.283395 | 104.948833 | 0.473067 | 42.339466 | 0.935622 | 0.897170 | 605.535361 |
| std | 8660.398374 | 0.374522 | 240.063401 | 1248.533877 | 1.318419 | 902.166491 | 1939.588574 | 2.151021 | 0.312366 | 196.864753 | 1.397094 | 13.215883 | 1.755761 | 0.814515 | 1223.783876 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -319.010000 | 0.000000 | 0.000000 | 3.000000 | 0.000000 | 6.000000 | 0.000000 | 0.000000 | 9.750000 |
| 25% | 7500.750000 | 0.000000 | 0.000000 | 145.100000 | 1.000000 | 76.000000 | 0.000000 | 2.000000 | 0.000000 | 10.000000 | 0.000000 | 32.000000 | 0.000000 | 0.000000 | 185.410000 |
| 50% | 15000.500000 | 0.000000 | 0.000000 | 298.720000 | 1.000000 | 202.715000 | 0.000000 | 3.000000 | 0.210000 | 77.000000 | 0.000000 | 41.000000 | 0.000000 | 1.000000 | 355.650000 |
| 75% | 22500.250000 | 0.000000 | 0.000000 | 638.645000 | 2.000000 | 500.677500 | 481.322500 | 4.000000 | 0.450000 | 126.000000 | 0.000000 | 51.000000 | 1.000000 | 1.000000 | 725.480000 |
| max | 30000.000000 | 1.000000 | 25000.000000 | 84561.840000 | 15.000000 | 53982.610000 | 110158.640000 | 35.000000 | 1.000000 | 6193.000000 | 40.000000 | 117.000000 | 24.000000 | 15.000000 | 84561.840000 |
#Check for non-numerical columns
df_dataset.select_dtypes(exclude=np.number)
| keydate | c0039 | c0044 | d0012 | |
|---|---|---|---|---|
| 0 | 2017-08-12 00:00:00.0 | K6622 | 1 | 1 |
| 1 | 2017-02-03 00:00:00.0 | K6512 | 2 | 2 |
| 2 | 2017-02-17 00:00:00.0 | K6512 | 3 | 3 |
| 3 | 2017-09-18 00:00:00.0 | K6622 | 4 | 4 |
| 4 | 2017-07-22 00:00:00.0 | K6512 | 5 | 1 |
| ... | ... | ... | ... | ... |
| 29995 | 2017-05-22 00:00:00.0 | K6512 | 27 | 7 |
| 29996 | 2017-08-20 00:00:00.0 | K6512 | 10 | 4 |
| 29997 | 2017-06-11 00:00:00.0 | K6512 | 6 | 1 |
| 29998 | 2017-02-17 00:00:00.0 | K6512 | 11 | 6 |
| 29999 | 2017-08-10 00:00:00.0 | K6419 | 36 | 7 |
30000 rows × 4 columns
df_metadata['name']
0 customers_at_address/success_closed_cases_36M 1 target_amount90_days 2 original_capital 3 amount_of_case 4 age_of_debt 5 no.telephones 6 industry_code 7 client_name 8 customer_age 9 target90_days 10 cust_payments_all_cases 11 amount_of_customer_open_cases 12 last_original_closing_code 13 cust_incoming_call_dates 14 success_closed_cases_24M 15 cust_payments_12M 16 failed_closed_cust_cases_36M Name: name, dtype: object
#Rename columns in df_dataset:
# create a dictionary to map column names from df_dataset to df_metadata
column_map = dict(zip(df_metadata['varcode'], df_metadata['name']))
# rename columns in df_dataset using the column_map dictionary
df_dataset = df_dataset.rename(columns=column_map)
pd.set_option('display.max_colwidth', 0)
df_dataset.set_index("case_id")
df_dataset
| case_id | keydate | target90_days | target_amount90_days | original_capital | industry_code | client_name | failed_closed_cust_cases_36M | cust_payments_all_cases | amount_of_customer_open_cases | no.telephones | customers_at_address/success_closed_cases_36M | age_of_debt | cust_incoming_call_dates | customer_age | cust_payments_12M | last_original_closing_code | success_closed_cases_24M | amount_of_case | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2017-08-12 00:00:00.0 | 0.0 | 0.0 | 221.68 | K6622 | 1 | 2.0 | 0.00 | 238.38 | 2.0 | NaN | 98.0 | 0.0 | 49.0 | 0.0 | 1 | NaN | 222.69 |
| 1 | 2 | 2017-02-03 00:00:00.0 | 0.0 | 0.0 | 151.36 | K6512 | 2 | NaN | 210.53 | 0.00 | 5.0 | NaN | 109.0 | 2.0 | 51.0 | 0.0 | 2 | 1.0 | 212.72 |
| 2 | 3 | 2017-02-17 00:00:00.0 | 0.0 | 0.0 | 48.84 | K6512 | 3 | 1.0 | NaN | 0.00 | 2.0 | 0.00 | 748.0 | 0.0 | 48.0 | NaN | 3 | NaN | 56.84 |
| 3 | 4 | 2017-09-18 00:00:00.0 | 0.0 | 0.0 | 413.15 | K6622 | 4 | NaN | NaN | 54.14 | 3.0 | 1.00 | 8.0 | 0.0 | 27.0 | NaN | 4 | NaN | 463.15 |
| 4 | 5 | 2017-07-22 00:00:00.0 | 0.0 | 0.0 | 125.83 | K6512 | 5 | 2.0 | 20.00 | 83.00 | 6.0 | NaN | 324.0 | 0.0 | 40.0 | 0.0 | 1 | NaN | 146.09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29995 | 29996 | 2017-05-22 00:00:00.0 | 0.0 | 0.0 | 435.46 | K6512 | 27 | NaN | 978.62 | 242.06 | 3.0 | 1.00 | 4.0 | 0.0 | 40.0 | 3.0 | 7 | 1.0 | 435.46 |
| 29996 | 29997 | 2017-08-20 00:00:00.0 | 1.0 | 188.4 | 344.07 | K6512 | 10 | NaN | NaN | 25408.75 | 2.0 | 0.27 | 111.0 | 1.0 | 40.0 | NaN | 4 | NaN | 372.48 |
| 29997 | 29998 | 2017-06-11 00:00:00.0 | 0.0 | 0.0 | 417.23 | K6512 | 6 | 1.0 | NaN | 0.00 | 3.0 | 0.09 | 103.0 | 0.0 | 41.0 | NaN | 1 | NaN | 516.45 |
| 29998 | 29999 | 2017-02-17 00:00:00.0 | 0.0 | 0.0 | 529.00 | K6512 | 11 | NaN | 101.90 | 0.00 | 2.0 | NaN | 199.0 | 0.0 | 54.0 | 0.0 | 6 | 1.0 | 544.00 |
| 29999 | 30000 | 2017-08-10 00:00:00.0 | 0.0 | 0.0 | 174.23 | K6419 | 36 | NaN | 56.62 | 0.00 | 1.0 | 0.00 | 7.0 | 0.0 | 33.0 | 1.0 | 7 | 1.0 | 174.65 |
30000 rows × 19 columns
#we do not need case_id because we have an index already
df_dataset.drop('case_id', axis=1, inplace=True)
df_dataset
| keydate | target90_days | target_amount90_days | original_capital | industry_code | client_name | failed_closed_cust_cases_36M | cust_payments_all_cases | amount_of_customer_open_cases | no.telephones | customers_at_address/success_closed_cases_36M | age_of_debt | cust_incoming_call_dates | customer_age | cust_payments_12M | last_original_closing_code | success_closed_cases_24M | amount_of_case | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-08-12 00:00:00.0 | 0.0 | 0.0 | 221.68 | K6622 | 1 | 2.0 | 0.00 | 238.38 | 2.0 | NaN | 98.0 | 0.0 | 49.0 | 0.0 | 1 | NaN | 222.69 |
| 1 | 2017-02-03 00:00:00.0 | 0.0 | 0.0 | 151.36 | K6512 | 2 | NaN | 210.53 | 0.00 | 5.0 | NaN | 109.0 | 2.0 | 51.0 | 0.0 | 2 | 1.0 | 212.72 |
| 2 | 2017-02-17 00:00:00.0 | 0.0 | 0.0 | 48.84 | K6512 | 3 | 1.0 | NaN | 0.00 | 2.0 | 0.00 | 748.0 | 0.0 | 48.0 | NaN | 3 | NaN | 56.84 |
| 3 | 2017-09-18 00:00:00.0 | 0.0 | 0.0 | 413.15 | K6622 | 4 | NaN | NaN | 54.14 | 3.0 | 1.00 | 8.0 | 0.0 | 27.0 | NaN | 4 | NaN | 463.15 |
| 4 | 2017-07-22 00:00:00.0 | 0.0 | 0.0 | 125.83 | K6512 | 5 | 2.0 | 20.00 | 83.00 | 6.0 | NaN | 324.0 | 0.0 | 40.0 | 0.0 | 1 | NaN | 146.09 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 29995 | 2017-05-22 00:00:00.0 | 0.0 | 0.0 | 435.46 | K6512 | 27 | NaN | 978.62 | 242.06 | 3.0 | 1.00 | 4.0 | 0.0 | 40.0 | 3.0 | 7 | 1.0 | 435.46 |
| 29996 | 2017-08-20 00:00:00.0 | 1.0 | 188.4 | 344.07 | K6512 | 10 | NaN | NaN | 25408.75 | 2.0 | 0.27 | 111.0 | 1.0 | 40.0 | NaN | 4 | NaN | 372.48 |
| 29997 | 2017-06-11 00:00:00.0 | 0.0 | 0.0 | 417.23 | K6512 | 6 | 1.0 | NaN | 0.00 | 3.0 | 0.09 | 103.0 | 0.0 | 41.0 | NaN | 1 | NaN | 516.45 |
| 29998 | 2017-02-17 00:00:00.0 | 0.0 | 0.0 | 529.00 | K6512 | 11 | NaN | 101.90 | 0.00 | 2.0 | NaN | 199.0 | 0.0 | 54.0 | 0.0 | 6 | 1.0 | 544.00 |
| 29999 | 2017-08-10 00:00:00.0 | 0.0 | 0.0 | 174.23 | K6419 | 36 | NaN | 56.62 | 0.00 | 1.0 | 0.00 | 7.0 | 0.0 | 33.0 | 1.0 | 7 | 1.0 | 174.65 |
30000 rows × 18 columns
In this part, I decided to simultenously explore, visualize, and correct variables and common problems aassociated with data. I will concentrate on distribution and provide more context before deciding on transformation, normalization, and scailing. Instead of creating multiple lines of code for each graph I will use a valuable tool — Visual Analytics — to provide the reader with the multiple intercative graphs in a one window. Then I will move to investigating the distribution of each variable and applying neccesary changes, if needed.
When it comes to particular cleaning processes, I will look at (if necessary) :
from pandas_visual_analysis import VisualAnalysis
VisualAnalysis(df_dataset)
VBox(children=(ToggleButtons(_dom_classes=('layout-ea41b391cfba4f83a1d7989a668c0ba0',), description='Selection…
Because Visual Analysis library is not visible in html mode I include the screenshot
#Check for duplicates
print("Number of Duplicates", df_dataset.duplicated().sum())
Number of Duplicates 0
Comment
Fortunately, we do not have duplicates in our dataset.
From the previous interactive graph we see that distribution of many of our variables are rather (highly) shewed.
In cases, where the payment was not done within 90 days, those cases have higher range of initial capital, and far more occurences of initial capital greater than 20k which is almost not the case of those indeed paid within 90 days.
Also, in the cases when payment was not done, there was a bigger amount of customer open cases and in those cases we can detect higher number of telephones.
Crucially, in payments failed (target90_days = 0), the age of debt of those cases range much higher, frequently above 2000, up to 6000.
Interestingly enough, client who did not pay within 90 days have far more outliers when it comes to age(e.g. age=6 which is impossible, and age=117 (that’s maximum value for that variable) which is highly improbable)
However, the most interesting and important conclusion is that our dataset is heavily imbalanced when it comes to the target variable.
In this part, I am interested in investigation of each variable alone — especially in the context of outliers, distribution, and frequencies.
df_dataset1=df_dataset.copy()
# Univariate Analysis for Numerical Columns
numeric_cols = df_dataset.select_dtypes(include=np.number)
# plot histogram for each numeric column
for col in numeric_cols.columns.tolist():
sns.histplot(df_dataset[col])
plt.title(col)
plt.show()
# plot boxplot for each numeric column
for col in numeric_cols.columns.tolist():
sns.boxplot(df_dataset[col])
plt.title(col)
plt.show()
Comment
Based on the analysis of graphs above, we can conclude that target variables expose imblanced characteristics. Moreover, when it comes to independent numerical variables, most of them are right-skew. This may be due to:
1. Outliers
2. Floor or Ceiling Effects: In some data sets, there may be lower or upper limits that restrict the range of values that can be measured. This can result in a clustering of data points at the limit.
3. Exponential Growth: When a variable experiences exponential growth, it can cause a right-skewed distribution. This is because the variable will start with low values but experience rapid growth over time, resulting in a few high values that skew the distribution to the right.
4. Sampling Bias: Sampling bias occurs when a data set is not representative of the population from which it was drawn. If the sample is skewed to the right, the resulting data set will also be right-skewed.
5. Limited Precision: Limited precision can occur when data is collected using instruments that have limited measurement accuracy. This can result in a clustering of data points at certain values, leading to a right-skewed distribution.
Moreover, we can clearly see the presence of many outliters (based on boxplot analysis). The problem here is that we don't want to remove data points because we have heavily imabalanced dataset. Removing outliers right now would exacerbate the probem.
# Univariate Analysis for Non-Numerical Columns
non_numeric_cols = df_dataset.select_dtypes(exclude=np.number)
print(non_numeric_cols.dtypes)
non_numeric_cols
keydate object industry_code object client_name object last_original_closing_code object dtype: object
| keydate | industry_code | client_name | last_original_closing_code | |
|---|---|---|---|---|
| 0 | 2017-08-12 00:00:00.0 | K6622 | 1 | 1 |
| 1 | 2017-02-03 00:00:00.0 | K6512 | 2 | 2 |
| 2 | 2017-02-17 00:00:00.0 | K6512 | 3 | 3 |
| 3 | 2017-09-18 00:00:00.0 | K6622 | 4 | 4 |
| 4 | 2017-07-22 00:00:00.0 | K6512 | 5 | 1 |
| ... | ... | ... | ... | ... |
| 29995 | 2017-05-22 00:00:00.0 | K6512 | 27 | 7 |
| 29996 | 2017-08-20 00:00:00.0 | K6512 | 10 | 4 |
| 29997 | 2017-06-11 00:00:00.0 | K6512 | 6 | 1 |
| 29998 | 2017-02-17 00:00:00.0 | K6512 | 11 | 6 |
| 29999 | 2017-08-10 00:00:00.0 | K6419 | 36 | 7 |
30000 rows × 4 columns
#Convert keydate column to date formate
df_dataset['keydate'] = pd.to_datetime(df_dataset['keydate'], format='%Y-%m-%d %H:%M:%S.%f')
# plot countplot for each non-numeric column
for col in non_numeric_cols.columns.tolist():
if col == "keydate":
continue
fig, ax = plt.subplots(figsize=(25, 7))
sns.countplot(df_dataset[col])
plt.title(col)
plt.xticks(rotation=45)
plt.show()
#Analysis of datatime (keydate column)
# Filter the DataFrame to include only rows where target is 1 and then when it's 0
filtered_1 = df_dataset[df_dataset['target90_days'] == 1]
filtered_0 = df_dataset[df_dataset['target90_days'] == 0]
# plot the distribution of the year column
plt.hist(df_dataset['keydate'].dt.year, bins=10)
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.title('Distribution of Year — General')
plt.show()
# plot the distribution of the year column
plt.hist(filtered_0['keydate'].dt.year, bins=10)
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.title('Distribution of Year — Payment Failed')
plt.show()
# plot the distribution of the year column
plt.hist(filtered_1['keydate'].dt.year, bins=10)
plt.xlabel('Year')
plt.ylabel('Frequency')
plt.title('Distribution of Year — Payment successfull')
plt.show()
# plot the distribution of the month column
plt.hist(df_dataset['keydate'].dt.month, bins=12)
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.title('Distribution of Month — General')
plt.show()
# plot the distribution of the month column
plt.hist(filtered_0['keydate'].dt.month, bins=12)
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.title('Distribution of Month — Payment Failed')
plt.show()
# plot the distribution of the month column
plt.hist(filtered_1['keydate'].dt.month, bins=12)
plt.xlabel('Month')
plt.ylabel('Frequency')
plt.title('Distribution of Month — Payment successfull')
plt.show()
# plot the distribution of the day column
plt.hist(df_dataset['keydate'].dt.day, bins=31)
plt.xlabel('Day')
plt.ylabel('Frequency')
plt.title('Distribution of Day — General')
plt.show()
# plot the distribution of the day column
plt.hist(filtered_0['keydate'].dt.day, bins=31)
plt.xlabel('Day')
plt.ylabel('Frequency')
plt.title('Distribution of Day — Payment Failed')
plt.show()
# plot the distribution of the day column
plt.hist(filtered_1['keydate'].dt.day, bins=31)
plt.xlabel('Day')
plt.ylabel('Frequency')
plt.title('Distribution of Day — Payment successfull')
plt.show()
Comment — Key Conclusions:
Before further analysis, I decided to first check missing values, because they can heavily impace the distribution and other statistical computations.
#Missing values for Each Category
for col in df_dataset.columns:
if df_dataset[col].isnull().values.any():
print(col)
missing_count = df_dataset[col].isnull().sum()
print("Missing Values: ", missing_count, "({:.2%})\n".format(missing_count/ df_dataset.shape[0]))
sns.heatmap(df_dataset.isnull(), yticklabels=False, cbar=False, cmap="viridis")
original_capital Missing Values: 25 (0.08%) failed_closed_cust_cases_36M Missing Values: 12629 (42.10%) cust_payments_all_cases Missing Values: 16926 (56.42%) customers_at_address/success_closed_cases_36M Missing Values: 10401 (34.67%) customer_age Missing Values: 2984 (9.95%) cust_payments_12M Missing Values: 16952 (56.51%) success_closed_cases_24M Missing Values: 21238 (70.79%)
<AxesSubplot:>
Comment
The problem with missing value is severe. Except for original_capital variable, a lot of variables have missing data that constitute a significant percentage of all the cases in the dataset, which is quite alaraming.
Upon further pondering over how this problem should be tackled, I decided that: original_capital's missing values are a tiny percentage so I decided to drop them.
When it comes to the rest: missing values in those cases constitute more than 1-5% of the total dataset, so I decided to fill in thos missing values. I decided to fill with a mean value, in order to not artificially skew the distribution in one way or another.
df_dataset.dropna(subset=['original_capital'], inplace=True)
#Fill with mean
mean_value=df_dataset['failed_closed_cust_cases_36M'].mean()
df_dataset['failed_closed_cust_cases_36M'].fillna(value=mean_value, inplace=True)
mean_value=df_dataset['cust_payments_all_cases'].mean()
df_dataset['cust_payments_all_cases'].fillna(value=mean_value, inplace=True)
mean_value=df_dataset['customers_at_address/success_closed_cases_36M'].mean()
df_dataset['customers_at_address/success_closed_cases_36M'].fillna(value=mean_value, inplace=True)
mean_value=df_dataset['customer_age'].mean()
df_dataset['customer_age'].fillna(value=mean_value, inplace=True)
mean_value=df_dataset['cust_payments_12M'].mean()
df_dataset['cust_payments_12M'].fillna(value=mean_value, inplace=True)
mean_value=df_dataset['success_closed_cases_24M'].mean()
df_dataset['success_closed_cases_24M'].fillna(value=mean_value, inplace=True)
#Missing values for Each Category
for col in df_dataset.columns:
if df_dataset[col].isnull().values.any():
print(col)
missing_count = df_dataset[col].isnull().sum()
print("Missing Values: ", missing_count, "({:.2%})\n".format(missing_count/ df_dataset.shape[0]))
sns.heatmap(df_dataset.isnull(), yticklabels=False, cbar=False, cmap="viridis")
<AxesSubplot:>
sns.pairplot(df_dataset)
plt.show()
Comment
Next, I want to further investigate keydate column by creating the following table:
Create a new column with the year and month information from the keydate column. thod.
from tabulate import tabulate
dataset1=df_dataset.copy()
date_df = df_dataset1.assign(
key_date=lambda x: (pd.to_numeric(x['keydate'].str[:4]) * 100 + pd.to_numeric(x['keydate'].str[6:7])),
last_original_closing_code=pd.to_numeric(df_dataset1['last_original_closing_code']),
client_name=pd.to_numeric(df_dataset1['client_name']),
industry_code=pd.Categorical(df_dataset1['industry_code'])
).drop(columns=['keydate'])
grouped_df = date_df.groupby('key_date').agg(
Tot=('key_date', 'count'),
Good=('target90_days', lambda x: sum(x == 1)),
Bad=('target90_days', lambda x: sum(x == 0))
).assign(
GoodRate=lambda x: x.Good / x.Tot,
BadRate=lambda x: x.Bad / x.Tot
).reset_index()
table = tabulate(grouped_df, headers='keys', tablefmt='heavy_outline', showindex=False)
print(table)
┏━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ key_date ┃ Tot ┃ Good ┃ Bad ┃ GoodRate ┃ BadRate ┃ ┣━━━━━━━━━━━━╋━━━━━━━╋━━━━━━━━╋━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━┫ ┃ 201701 ┃ 2149 ┃ 441 ┃ 1708 ┃ 0.205212 ┃ 0.794788 ┃ ┃ 201702 ┃ 4059 ┃ 566 ┃ 3493 ┃ 0.139443 ┃ 0.860557 ┃ ┃ 201703 ┃ 3247 ┃ 537 ┃ 2710 ┃ 0.165383 ┃ 0.834617 ┃ ┃ 201704 ┃ 3027 ┃ 448 ┃ 2579 ┃ 0.148001 ┃ 0.851999 ┃ ┃ 201705 ┃ 3931 ┃ 632 ┃ 3299 ┃ 0.160773 ┃ 0.839227 ┃ ┃ 201706 ┃ 3485 ┃ 630 ┃ 2855 ┃ 0.180775 ┃ 0.819225 ┃ ┃ 201707 ┃ 3725 ┃ 667 ┃ 3058 ┃ 0.17906 ┃ 0.82094 ┃ ┃ 201708 ┃ 3201 ┃ 564 ┃ 2637 ┃ 0.176195 ┃ 0.823805 ┃ ┃ 201709 ┃ 3176 ┃ 577 ┃ 2599 ┃ 0.181675 ┃ 0.818325 ┃ ┗━━━━━━━━━━━━┻━━━━━━━┻━━━━━━━━┻━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━┛
In correlation analysis, we look for variables whose correlation coefficient will be higher than 0.8. In those cases we unnecessary use them because they provide a similiar information.
# Correlation Analysis
corr = df_dataset.select_dtypes(include=np.number).corr()
f, ax = plt.subplots(figsize=(22, 22))
sns.heatmap(corr, vmax=.8, square=True)
plt.title('Correlation Matrix')
plt.show()
df_dataset.corr()
| target90_days | target_amount90_days | original_capital | failed_closed_cust_cases_36M | cust_payments_all_cases | amount_of_customer_open_cases | no.telephones | customers_at_address/success_closed_cases_36M | age_of_debt | cust_incoming_call_dates | customer_age | cust_payments_12M | success_closed_cases_24M | amount_of_case | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| target90_days | 1.000000 | 0.455022 | -0.029447 | -0.060773 | 0.015346 | -0.051445 | -0.085102 | 0.062236 | 0.012971 | 0.064594 | 0.053495 | 0.024587 | 0.033723 | -0.027913 |
| target_amount90_days | 0.455022 | 1.000000 | 0.149193 | -0.024964 | 0.038019 | -0.012134 | -0.042509 | 0.036847 | 0.005711 | 0.023990 | 0.030312 | 0.030774 | 0.020208 | 0.157999 |
| original_capital | -0.029447 | 0.149193 | 1.000000 | 0.006116 | 0.029755 | 0.059502 | 0.017770 | 0.003586 | 0.221894 | -0.000576 | -0.005002 | 0.003320 | 0.006351 | 0.955115 |
| failed_closed_cust_cases_36M | -0.060773 | -0.024964 | 0.006116 | 1.000000 | -0.006194 | 0.102677 | 0.347225 | -0.035838 | -0.015670 | 0.069074 | -0.008807 | -0.011641 | -0.006692 | 0.008785 |
| cust_payments_all_cases | 0.015346 | 0.038019 | 0.029755 | -0.006194 | 1.000000 | 0.056416 | 0.043866 | 0.027977 | 0.026166 | 0.148860 | 0.041711 | 0.190925 | 0.149066 | 0.032384 |
| amount_of_customer_open_cases | -0.051445 | -0.012134 | 0.059502 | 0.102677 | 0.056416 | 1.000000 | 0.170067 | -0.007623 | -0.012296 | 0.052853 | 0.033223 | 0.074043 | -0.010468 | 0.063697 |
| no.telephones | -0.085102 | -0.042509 | 0.017770 | 0.347225 | 0.043866 | 0.170067 | 1.000000 | -0.048982 | -0.035476 | 0.156168 | 0.037316 | 0.013816 | -0.016682 | 0.025305 |
| customers_at_address/success_closed_cases_36M | 0.062236 | 0.036847 | 0.003586 | -0.035838 | 0.027977 | -0.007623 | -0.048982 | 1.000000 | 0.018288 | 0.016542 | 0.035798 | -0.000773 | 0.021125 | 0.000484 |
| age_of_debt | 0.012971 | 0.005711 | 0.221894 | -0.015670 | 0.026166 | -0.012296 | -0.035476 | 0.018288 | 1.000000 | -0.006299 | 0.000121 | -0.001202 | 0.008714 | 0.210353 |
| cust_incoming_call_dates | 0.064594 | 0.023990 | -0.000576 | 0.069074 | 0.148860 | 0.052853 | 0.156168 | 0.016542 | -0.006299 | 1.000000 | 0.040227 | 0.226333 | 0.037336 | 0.001225 |
| customer_age | 0.053495 | 0.030312 | -0.005002 | -0.008807 | 0.041711 | 0.033223 | 0.037316 | 0.035798 | 0.000121 | 0.040227 | 1.000000 | -0.005527 | -0.003331 | -0.002531 |
| cust_payments_12M | 0.024587 | 0.030774 | 0.003320 | -0.011641 | 0.190925 | 0.074043 | 0.013816 | -0.000773 | -0.001202 | 0.226333 | -0.005527 | 1.000000 | 0.238496 | 0.004585 |
| success_closed_cases_24M | 0.033723 | 0.020208 | 0.006351 | -0.006692 | 0.149066 | -0.010468 | -0.016682 | 0.021125 | 0.008714 | 0.037336 | -0.003331 | 0.238496 | 1.000000 | 0.005518 |
| amount_of_case | -0.027913 | 0.157999 | 0.955115 | 0.008785 | 0.032384 | 0.063697 | 0.025305 | 0.000484 | 0.210353 | 0.001225 | -0.002531 | 0.004585 | 0.005518 | 1.000000 |
def correlation(dataset, threshold):
col_corr = set()
corr_matrix = dataset.corr()
for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) > threshold:
colname = corr_matrix.columns[i]
col_corr.add(colname)
return col_corr
correlation(df_dataset, 0.8)
{'amount_of_case'}
Comment
Based on the correlation analysis, we conclude that amount_of_case is dangerously correlated with original capital. That is why, we decided to drop this column. For the further analysis, we also decided to drop the target amount of 90 days, because it's the target variable for regresssiont task.
df_dataset.drop(["amount_of_case", "target_amount90_days"], axis=1, inplace=True)
We apply Standard Scaler (based on standard deviation).
from sklearn.preprocessing import StandardScaler, RobustScaler
dataset1=df_dataset.copy()
std_scaler = StandardScaler()
for col in dataset1.select_dtypes(include=np.number):
if col == "target90_days":
continue
dataset1[col] = std_scaler.fit_transform(dataset1[col].values.reshape(-1,1))
dataset1.head()
| keydate | target90_days | original_capital | industry_code | client_name | failed_closed_cust_cases_36M | cust_payments_all_cases | amount_of_customer_open_cases | no.telephones | customers_at_address/success_closed_cases_36M | age_of_debt | cust_incoming_call_dates | customer_age | cust_payments_12M | last_original_closing_code | success_closed_cases_24M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017-08-12 | 0.0 | -0.253831 | K6622 | 1 | 3.645140e-01 | -7.106768e-01 | -0.163208 | -0.507865 | 0.000000 | -0.035703 | -0.338723 | 0.530841 | -8.081055e-01 | 1 | 2.521422e-16 |
| 1 | 2017-02-03 | 0.0 | -0.310154 | K6512 | 2 | -4.426620e-16 | -3.572967e-01 | -0.286075 | 0.886547 | 0.000000 | 0.020156 | 1.092316 | 0.690331 | -8.081055e-01 | 2 | 2.334117e-01 |
| 2 | 2017-02-17 | 0.0 | -0.392267 | K6512 | 3 | -6.322721e-01 | 9.541315e-17 | -0.286075 | -0.507865 | -1.122447 | 3.265089 | -0.338723 | 0.451095 | -9.587195e-17 | 3 | 2.521422e-16 |
| 3 | 2017-09-18 | 0.0 | -0.100472 | K6622 | 4 | -4.426620e-16 | 9.541315e-17 | -0.258170 | -0.043061 | 2.837585 | -0.492736 | -0.338723 | -1.223556 | -9.587195e-17 | 4 | 2.521422e-16 |
| 4 | 2017-07-22 | 0.0 | -0.330602 | K6512 | 5 | 3.645140e-01 | -6.771063e-01 | -0.243295 | 1.351351 | 0.000000 | 1.111957 | -0.338723 | -0.186867 | -8.081055e-01 | 1 | 2.521422e-16 |
dataset1.hist(bins=10, figsize=(15,15))
array([[<AxesSubplot:title={'center':'keydate'}>,
<AxesSubplot:title={'center':'target90_days'}>,
<AxesSubplot:title={'center':'original_capital'}>,
<AxesSubplot:title={'center':'failed_closed_cust_cases_36M'}>],
[<AxesSubplot:title={'center':'cust_payments_all_cases'}>,
<AxesSubplot:title={'center':'amount_of_customer_open_cases'}>,
<AxesSubplot:title={'center':'no.telephones'}>,
<AxesSubplot:title={'center':'customers_at_address/success_closed_cases_36M'}>],
[<AxesSubplot:title={'center':'age_of_debt'}>,
<AxesSubplot:title={'center':'cust_incoming_call_dates'}>,
<AxesSubplot:title={'center':'customer_age'}>,
<AxesSubplot:title={'center':'cust_payments_12M'}>],
[<AxesSubplot:title={'center':'success_closed_cases_24M'}>,
<AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
In this part, I concentrate on adding possilby new features, transforming existing ones, and extracting extra information.
I start with experimentation over keydate — extracting day, month, and year.
Then I check what we can do with categorical variables — I decided to drop client_name because it's a pretty unique thing for each of the client and not necessary for our model. morevoer, I convert the rest of th categorical variables to numerical ones.
Part of feature engineering process I did before while dropping non-essential columns.
# extract year, month, and day from the datetime object
dataset1['year'] = dataset1['keydate'].dt.year
dataset1['month'] = dataset1['keydate'].dt.month
dataset1['day'] = dataset1['keydate'].dt.day
dataset1.drop("keydate", axis=1, inplace=True)
#Check whether all columns are that important or we should drop them
print(dataset1.client_name.unique(), "\n\n")
print(dataset1.industry_code.unique())
['1' '2' '3' '4' '5' '6' '7' '8' '9' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29' '30' '31' '32' '33' '34' '35' '36' '37' '38' '39' '40' '41' '42' '43' '44' '45' '46' '47' '48' '49' '50' '51' '52' '53' '54' '55' '56' '57' '58' '59' '60' '61' '62' '63' '64' '65' '66' '67' '68' '69' '70' '71' '72' '73' '74' '75' '76' '77' '78' '79' '80' '81' '82' '83' '84' '85' '86' '87' '88' '89' '90' '91' '92' '93' '94' '95' '96' '97' '98' '99' '100' '101' '102' '103' '104' '105' '106' '107' '108' '109' '110' '111' '112' '113' '114' '115' '116' '117' '118' '119' '120' '121' '122' '123' '124' '125' '126' '127' '128' '129' '130' '131' '132' '133' '134' '135' '136' '137' '138' '139' '140' '141' '142' '143' '144' '145' '146' '147' '148' '149' '150' '151' '152' '153' '154' '155' '156' '157' '158' '159' '160' '161' '162' '163' '164' '165' '166' '167' '168' '169' '170' '171' '172' '173' '174' '175' '176' '177' '178' '179' '180' '181'] ['K6622' 'K6512' 'K6499' 'K6619' 'K6511' 'K6419' 'K6420' 'K6491']
#Drop Client Name
dataset1.drop("client_name", axis=1, inplace=True)
#Convert Industry Code
dataset2=dataset1.reset_index(drop=True).copy()
for i, code in enumerate(dataset2['industry_code']):
new_code = int(code.strip()[1:]) # remove the first character (K) and convert to int
dataset2.at[i, 'industry_code'] = new_code # assign the new numeric code to the DataFrame
dataset2.industry_code.value_counts()
6512 17305 6622 9915 6619 1391 6419 909 6499 410 6491 24 6420 15 6511 6 Name: industry_code, dtype: int64
dataset2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29975 entries, 0 to 29974 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 target90_days 29975 non-null float64 1 original_capital 29975 non-null float64 2 industry_code 29975 non-null object 3 failed_closed_cust_cases_36M 29975 non-null float64 4 cust_payments_all_cases 29975 non-null float64 5 amount_of_customer_open_cases 29975 non-null float64 6 no.telephones 29975 non-null float64 7 customers_at_address/success_closed_cases_36M 29975 non-null float64 8 age_of_debt 29975 non-null float64 9 cust_incoming_call_dates 29975 non-null float64 10 customer_age 29975 non-null float64 11 cust_payments_12M 29975 non-null float64 12 last_original_closing_code 29975 non-null object 13 success_closed_cases_24M 29975 non-null float64 14 year 29975 non-null int64 15 month 29975 non-null int64 16 day 29975 non-null int64 dtypes: float64(12), int64(3), object(2) memory usage: 3.9+ MB
dataset2['last_original_closing_code'] = pd.to_numeric(dataset2['last_original_closing_code'])
dataset2['industry_code'] = pd.to_numeric(dataset2['industry_code'])
dataset2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 29975 entries, 0 to 29974 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 target90_days 29975 non-null float64 1 original_capital 29975 non-null float64 2 industry_code 29975 non-null int64 3 failed_closed_cust_cases_36M 29975 non-null float64 4 cust_payments_all_cases 29975 non-null float64 5 amount_of_customer_open_cases 29975 non-null float64 6 no.telephones 29975 non-null float64 7 customers_at_address/success_closed_cases_36M 29975 non-null float64 8 age_of_debt 29975 non-null float64 9 cust_incoming_call_dates 29975 non-null float64 10 customer_age 29975 non-null float64 11 cust_payments_12M 29975 non-null float64 12 last_original_closing_code 29975 non-null int64 13 success_closed_cases_24M 29975 non-null float64 14 year 29975 non-null int64 15 month 29975 non-null int64 16 day 29975 non-null int64 dtypes: float64(12), int64(5) memory usage: 3.9 MB
To solve the issue of imbalance, I decided to go with undersampling and oversampling and compare the results. I considered the following methods:
Techniques for Undersampling:
Random Sampling Cluster Tomek Links Undersampling with ensemble learning Techniques for Oversampling
Random Sampling SMOTE ADASYN — Improved Version of SMOTE Augmentation and I dedcided in both cases to choose Random Sampling. This is because it will provide the exact environment to compare the methods (Random technique in both cases).
Undersampling
Advantages:
Help improve the runtime of the model & solve the memory problems by reducing the number of training data Disadvantages:
Can discard useful information Possibility to choose a biased sample which can cause the model toperform poorly on real unseen data. Oversampling
Main Advantage:
No information loss Disadvantages:
Possibility of overfitting since it replicates the minority class events.
from sklearn.model_selection import train_test_split
data_under = dataset2.copy()
data_over = dataset2.copy()
under_target = data_under.target90_days
under_predictors = data_under.drop("target90_days", axis=1)
over_target = data_over.target90_days
over_predictors = data_over.drop("target90_days", axis=1)
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(under_predictors, under_target, test_size=0.3, random_state=42, stratify=under_target)
X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(over_predictors, over_target, test_size=0.3, random_state=42, stratify=over_target)
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(sampling_strategy=0.9, random_state=42)
X_resample_under, y_resample_under = rus.fit_resample(X_train_under, y_train_under)
colors = ['#ef8a62' if class_ == 0 else '#f7f7f7' if class_ == 1 else '#67a9cf' for class_ in y_resample_under]
import matplotlib
plt.title("Before UnderSample")
matplotlib.pyplot.hist(y_train_under)
(array([17440., 0., 0., 0., 0., 0., 0., 0.,
0., 3542.]),
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
<BarContainer object of 10 artists>)
matplotlib.pyplot.hist(y_resample_under)
plt.title("After RandomUnderSampler")
Text(0.5, 1.0, 'After RandomUnderSampler')
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(sampling_strategy=0.9, random_state=0)
X_resample_over, y_resample_over = ros.fit_resample(X_train_over, y_train_over)
colors = ['#ef8a62' if v == 0 else '#f7f7f7' if v == 1 else '#67a9cf' for v in y_resample_over]
import matplotlib
plt.title("Before OverSample")
matplotlib.pyplot.hist(y_train_over)
(array([17440., 0., 0., 0., 0., 0., 0., 0.,
0., 3542.]),
array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
<BarContainer object of 10 artists>)
matplotlib.pyplot.hist(y_resample_over)
plt.title("After RandomUnderSampler")
Text(0.5, 1.0, 'After RandomUnderSampler')
#Outlier Detection
# plot boxplot for each numeric column
for col in X_train_under.columns:
sns.boxplot(X_train_under[col])
plt.title(col)
plt.show()
columns=X_train_under.columns
new_under=pd.concat([X_resample_under, y_resample_under], axis=1)
new_over=pd.concat([X_resample_over, y_resample_over], axis=1)
def remove_outliers(dataset, column):
q1, q3 = np.percentile(dataset[column], [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)
dataset = dataset[(dataset[column] >= lower_bound) & (dataset[column] <= upper_bound)]
return dataset
for col in columns:
over_without_outlier=remove_outliers(new_over, col)
under_without_outlier=remove_outliers(new_under, col)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import collections
import warnings
warnings.filterwarnings("ignore")
under_target = under_without_outlier.target90_days
under_predictors = under_without_outlier.drop("target90_days", axis=1)
over_target = over_without_outlier.target90_days
over_predictors = over_without_outlier.drop("target90_days", axis=1)
X_train_under, X_test_under, y_train_under, y_test_under = train_test_split(under_predictors, under_target, test_size=0.3, random_state=42, stratify=under_target)
X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(over_predictors, over_target, test_size=0.3, random_state=42, stratify=over_target)
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Decision Tree Classifier
dtc = DecisionTreeClassifier()
# Define the hyperparameters to tune
parameters = {'criterion': ['gini', 'entropy'],
'max_depth': [3, 4, 5, 6],
'min_samples_split': [2, 3, 4, 5],
'min_samples_leaf': [1, 2, 3]}
# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(dtc, parameters, scoring='f1', cv=5, n_jobs=-1)
grid_search.fit(X_train_under, y_train_under)
# Print the best hyperparameters found
print("Best parameters: ", grid_search.best_params_)
# Use the best hyperparameters to train the model
best_dtc = grid_search.best_estimator_
best_dtc.fit(X_train_under, y_train_under)
# Predict the classes and probabilities on the test set
y_pred = best_dtc.predict(X_test_under)
y_proba = best_dtc.predict_proba(X_test_under)[:, 1]
# Compute the F1 score and AUC-ROC score
f1 = f1_score(y_test_under, y_pred)
auc_roc = roc_auc_score(y_test_under, y_proba)
# Print the F1 score and AUC-ROC score
print("F1 score: ", f1)
print("AUC-ROC score: ", auc_roc)
# Print the classification report
class_report = classification_report(y_test_under, y_pred)
print("Classification Report:\n", class_report)
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Best parameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 3, 'min_samples_split': 2}
F1 score: 0.559648158328752
AUC-ROC score: 0.6797892788212231
Classification Report:
precision recall f1-score support
0.0 0.63 0.79 0.70 1181
1.0 0.67 0.48 0.56 1063
accuracy 0.64 2244
macro avg 0.65 0.63 0.63 2244
weighted avg 0.65 0.64 0.63 2244
#Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Logistic Regression model
lr = LogisticRegression(random_state=42)
# Define the hyperparameter grid to search over
param_grid = {'C': [0.1, 1, 10],
'penalty': ['l1', 'l2']}
# Use GridSearchCV to find the best hyperparameters
lr_grid = GridSearchCV(lr, param_grid, scoring='roc_auc', cv=5)
lr_grid.fit(X_train_under, y_train_under)
# Train the model with the best hyperparameters
lr_best = LogisticRegression(random_state=42, **lr_grid.best_params_)
lr_best.fit(X_train_under, y_train_under)
# Make predictions on the test set
y_pred = lr_best.predict(X_test_under)
y_pred_proba = lr_best.predict_proba(X_test_under)[:,1] # probability scores for class 1
# Compute the evaluation metrics
f1 = f1_score(y_test_under, y_pred)
auc_roc = roc_auc_score(y_test_under, y_pred_proba)
report = classification_report(y_test_under, y_pred)
# Print the evaluation metrics
print("F1 score:", f1)
print("AUC-ROC score:", auc_roc)
print("Classification report:\n", report)
F1 score: 0.5696324951644101
AUC-ROC score: 0.6397459620536194
Classification report:
precision recall f1-score support
0.0 0.62 0.65 0.63 1181
1.0 0.59 0.55 0.57 1063
accuracy 0.60 2244
macro avg 0.60 0.60 0.60 2244
weighted avg 0.60 0.60 0.60 2244
#Random Forest
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Random Forest Classifier
rfc = RandomForestClassifier()
# Define the hyperparameters to tune
parameters = {'n_estimators': [50, 100, 200, 500],
'criterion': ['gini', 'entropy'],
'max_depth': [3, 4, 5, 6],
'min_samples_split': [2, 3, 4, 5],
'min_samples_leaf': [1, 2, 3]}
# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(rfc, parameters, scoring='f1', cv=5, n_jobs=-1)
grid_search.fit(X_train_under, y_train_under)
# Print the best hyperparameters found
print("Best parameters: ", grid_search.best_params_)
# Use the best hyperparameters to train the model
best_rfc = grid_search.best_estimator_
best_rfc.fit(X_train_under, y_train_under)
# Predict the classes and probabilities on the test set
y_pred = best_rfc.predict(X_test_under)
y_proba = best_rfc.predict_proba(X_test_under)[:, 1]
# Compute the F1 score and AUC-ROC score
f1 = f1_score(y_test_under, y_pred)
auc_roc = roc_auc_score(y_test_under, y_proba)
# Print the F1 score and AUC-ROC score
print("F1 score: ", f1)
print("AUC-ROC score: ", auc_roc)
# Print the classification report
class_report = classification_report(y_test_under, y_pred)
print("Classification Report:\n", class_report)
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Best parameters: {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 3, 'min_samples_split': 4, 'n_estimators': 50}
F1 score: 0.6045785639958376
AUC-ROC score: 0.7126480500683844
Classification Report:
precision recall f1-score support
0.0 0.65 0.76 0.70 1181
1.0 0.68 0.55 0.60 1063
accuracy 0.66 2244
macro avg 0.66 0.66 0.65 2244
weighted avg 0.66 0.66 0.66 2244
#Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Gradient Boosting model
gbc = GradientBoostingClassifier(random_state=42)
# Define the hyperparameter grid to search over
param_grid = {'learning_rate': [0.01, 0.1, 1],
'n_estimators': [50, 100, 200],
'max_depth': [3, 4, 5]}
# Use GridSearchCV to find the best hyperparameters
gbc_grid = GridSearchCV(gbc, param_grid, scoring='roc_auc', cv=5)
gbc_grid.fit(X_train_under, y_train_under)
# Train the model with the best hyperparameters
gbc_best = GradientBoostingClassifier(random_state=42, **gbc_grid.best_params_)
gbc_best.fit(X_train_under, y_train_under)
# Make predictions on the test set
y_pred = gbc_best.predict(X_test_under)
y_pred_proba = gbc_best.predict_proba(X_test_under)[:,1] # probability scores for class 1
# Compute the evaluation metrics
f1 = f1_score(y_test_under, y_pred)
auc_roc = roc_auc_score(y_test_under, y_pred_proba)
report = classification_report(y_test_under, y_pred)
# Print the evaluation metrics
print("F1 score:", f1)
print("AUC-ROC score:", auc_roc)
print("Classification report:\n", report)
F1 score: 0.614070351758794
AUC-ROC score: 0.7060633119404686
Classification report:
precision recall f1-score support
0.0 0.66 0.73 0.69 1181
1.0 0.66 0.57 0.61 1063
accuracy 0.66 2244
macro avg 0.66 0.65 0.65 2244
weighted avg 0.66 0.66 0.66 2244
#XGBoost
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, roc_auc_score, classification_report
from sklearn.model_selection import GridSearchCV
# Define XGBoost classifier
xgb = XGBClassifier()
# Define parameter grid for hyperparameter tuning
param_grid = {
"learning_rate": [0.01, 0.05, 0.1],
"n_estimators": [100, 500, 1000],
"max_depth": [3, 5, 7]
}
# Define grid search with cross-validation
grid_search = GridSearchCV(xgb, param_grid=param_grid, cv=5, scoring='f1')
# Fit the grid search to the training data
grid_search.fit(X_train_under, y_train_under)
# Get the best parameters and score from the grid search
best_params = grid_search.best_params_
best_score = grid_search.best_score_
# Print the best parameters and score
print("Best parameters: ", best_params)
print("Best score: ", best_score)
# Train the XGBoost classifier with the best parameters
xgb_best = XGBClassifier(**best_params)
xgb_best.fit(X_train_under, y_train_under)
# Make predictions on the test set
y_pred = xgb_best.predict(X_test_under)
y_proba = xgb_best.predict_proba(X_test_under)[:,1]
# Calculate evaluation metrics
f1 = f1_score(y_test_under, y_pred)
roc_auc = roc_auc_score(y_test_under, y_proba)
class_report = classification_report(y_test_under, y_pred)
# Print evaluation metrics
print("F1 Score:", f1)
print("AUC-ROC Score:", roc_auc)
print("Classification Report:", class_report)
Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
Best score: 0.6204891027120174
F1 Score: 0.6189283925888834
AUC-ROC Score: 0.7123899656126359
Classification Report: precision recall f1-score support
0.0 0.66 0.73 0.69 1181
1.0 0.66 0.58 0.62 1063
accuracy 0.66 2244
macro avg 0.66 0.66 0.66 2244
weighted avg 0.66 0.66 0.66 2244
#LightGMB
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
from sklearn.model_selection import GridSearchCV
# Define LightGBM classifier
lgbm = lgb.LGBMClassifier()
# Define parameter grid for hyperparameter tuning
param_grid = {
"learning_rate": [0.01, 0.05, 0.1],
"n_estimators": [100, 500, 1000],
"max_depth": [3, 5, 7]
}
# Define grid search with cross-validation
grid_search = GridSearchCV(lgbm, param_grid=param_grid, cv=5, scoring='f1')
# Fit the grid search to the training data
grid_search.fit(X_train_under, y_train_under)
# Get the best parameters and score from the grid search
best_params = grid_search.best_params_
best_score = grid_search.best_score_
# Print the best parameters and score
print("Best parameters: ", best_params)
print("Best score: ", best_score)
# Train the LightGBM classifier with the best parameters
lgbm_best = lgb.LGBMClassifier(**best_params)
lgbm_best.fit(X_train_under, y_train_under)
# Make predictions on the test set
y_pred = lgbm_best.predict(X_test_under)
y_proba = lgbm_best.predict_proba(X_test_under)[:,1]
# Calculate evaluation metrics
f1 = f1_score(y_test_under, y_pred)
roc_auc = roc_auc_score(y_test_under, y_proba)
class_report = classification_report(y_test_under, y_pred)
# Print evaluation metrics
print("F1 Score:", f1)
print("AUC-ROC Score:", roc_auc)
print("Classification Report:", class_report)
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Best parameters: {'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 500}
Best score: 0.6157119072333505
F1 Score: 0.6204523107177974
AUC-ROC Score: 0.7048971525478274
Classification Report: precision recall f1-score support
0.0 0.66 0.71 0.69 1181
1.0 0.65 0.59 0.62 1063
accuracy 0.66 2244
macro avg 0.66 0.65 0.65 2244
weighted avg 0.66 0.66 0.65 2244
from sklearn.metrics import roc_curve
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_curve
"""
The code below is created by author of this submission to Kaggle. It serves an inspiration.
Source: https://www.kaggle.com/code/janiobachmann/credit-fraud-dealing-with-imbalanced-datasets
"""
log_reg_pred = cross_val_predict(lr_best, X_train_under, y_train_under, cv=5,
method="decision_function")
best_rfc_pred=cross_val_predict(best_rfc, X_train_under, y_train_under, cv=5)
gbc_best_pred = cross_val_predict(gbc_best, X_train_under, y_train_under, cv=5)
xgb_best_pred = cross_val_predict(xgb_best, X_train_under, y_train_under, cv=5)
lgbm_best_pred = cross_val_predict(lgbm_best, X_train_under, y_train_under, cv=5)
tree_pred_pred = cross_val_predict(best_dtc, X_train_under, y_train_under, cv=5)
log_fpr, log_tpr, log_thresold = roc_curve(y_train_under, log_reg_pred)
rfc_fpr, rfc_tpr, rfc_threshold = roc_curve(y_train_under, best_rfc_pred)
gbc_fpr, gbc_tpr, gbc_threshold = roc_curve(y_train_under, gbc_best_pred)
xgb_fpr, xgb_tpr, xgb_threshold = roc_curve(y_train_under, xgb_best_pred)
lgbm_fpr, lgbm_tpr, lgbm_threshold = roc_curve(y_train_under, lgbm_best_pred)
tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train_under, tree_pred_pred)
def graph_roc_curve_multiple(log_fpr, log_tpr, rfc_fpr, rfc_tpr, gbc_fpr, gbc_tpr, xgb_fpr, xgb_tpr, lgbm_fpr, lgbm_tpr, tree_fpr, tree_tpr):
plt.figure(figsize=(16,8))
plt.title('ROC Curve \n Top 6 Classifiers', fontsize=18)
plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train_under, log_reg_pred)))
plt.plot(rfc_fpr, rfc_tpr, label='Random Forest Classifier Score: {:.4f}'.format(roc_auc_score(y_train_under, best_rfc_pred)))
plt.plot(gbc_fpr, gbc_tpr, label='Gradient Boosting Classifier Score: {:.4f}'.format(roc_auc_score(y_train_under, gbc_best_pred)))
plt.plot(xgb_fpr, xgb_tpr, label='XGBoost Classifier Score: {:.4f}'.format(roc_auc_score(y_train_under, gbc_best_pred)))
plt.plot(lgbm_fpr, lgbm_tpr, label='Light GBM Classifier Score: {:.4f}'.format(roc_auc_score(y_train_under, lgbm_best_pred)))
plt.plot(tree_fpr, tree_tpr, label='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train_under, tree_pred_pred)))
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.01, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
arrowprops=dict(facecolor='#6E726D', shrink=0.05),
)
plt.legend()
graph_roc_curve_multiple(log_fpr, log_tpr, rfc_fpr, rfc_tpr, gbc_fpr, gbc_tpr, xgb_fpr, xgb_tpr, lgbm_fpr, lgbm_tpr, tree_fpr, tree_tpr)
plt.show()
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
#Decision Tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Decision Tree Classifier
dtc = DecisionTreeClassifier()
# Define the hyperparameters to tune
parameters = {'criterion': ['gini', 'entropy'],
'max_depth': [3, 4, 5, 6],
'min_samples_split': [2, 3, 4, 5],
'min_samples_leaf': [1, 2, 3]}
# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(dtc, parameters, scoring='f1', cv=5, n_jobs=-1)
grid_search.fit(X_train_over, y_train_over)
# Print the best hyperparameters found
print("Best parameters: ", grid_search.best_params_)
# Use the best hyperparameters to train the model
best_dtc = grid_search.best_estimator_
best_dtc.fit(X_train_over, y_train_over)
# Predict the classes and probabilities on the test set
y_pred = best_dtc.predict(X_test_over)
y_proba = best_dtc.predict_proba(X_test_over)[:, 1]
# Compute the F1 score and AUC-ROC score
f1 = f1_score(y_test_over, y_pred)
auc_roc = roc_auc_score(y_test_over, y_proba)
# Print the F1 score and AUC-ROC score
print("F1 score: ", f1)
print("AUC-ROC score: ", auc_roc)
# Print the classification report
class_report = classification_report(y_test_over, y_pred)
print("Classification Report:\n", class_report)
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/bartekrzycki/opt/anaconda3/lib/python3.9/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.2
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Best parameters: {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 2}
F1 score: 0.5733830845771144
AUC-ROC score: 0.7122546137820546
Classification Report:
precision recall f1-score support
0.0 0.64 0.80 0.71 5232
1.0 0.69 0.49 0.57 4709
accuracy 0.65 9941
macro avg 0.66 0.65 0.64 9941
weighted avg 0.66 0.65 0.65 9941
#Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Logistic Regression model
lr = LogisticRegression(random_state=42)
# Define the hyperparameter grid to search over
param_grid = {'C': [0.1, 1, 10],
'penalty': ['l1', 'l2']}
# Use GridSearchCV to find the best hyperparameters
lr_grid = GridSearchCV(lr, param_grid, scoring='roc_auc', cv=5)
lr_grid.fit(X_train_over, y_train_over)
# Train the model with the best hyperparameters
lr_best = LogisticRegression(random_state=42, **lr_grid.best_params_)
lr_best.fit(X_train_over, y_train_over)
# Make predictions on the test set
y_pred = lr_best.predict(X_test_over)
y_pred_proba = lr_best.predict_proba(X_test_over)[:,1] # probability scores for class 1
# Compute the evaluation metrics
f1 = f1_score(y_test_over, y_pred)
auc_roc = roc_auc_score(y_test_over, y_pred_proba)
report = classification_report(y_test_over, y_pred)
# Print the evaluation metrics
print("F1 score:", f1)
print("AUC-ROC score:", auc_roc)
print("Classification report:\n", report)
F1 score: 0.5772392903504976
AUC-ROC score: 0.6477845062775881
Classification report:
precision recall f1-score support
0.0 0.62 0.64 0.63 5232
1.0 0.59 0.57 0.58 4709
accuracy 0.61 9941
macro avg 0.61 0.60 0.60 9941
weighted avg 0.61 0.61 0.61 9941
#Random Forest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Random Forest Classifier
rfc = RandomForestClassifier()
# Define the hyperparameters to tune
parameters = {'n_estimators': [50, 100, 200, 500],
'criterion': ['gini', 'entropy'],
'max_depth': [3, 4, 5, 6],
'min_samples_split': [2, 3, 4, 5],
'min_samples_leaf': [1, 2, 3]}
# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(rfc, parameters, scoring='f1', cv=5, n_jobs=-1)
grid_search.fit(X_train_over, y_train_over)
# Print the best hyperparameters found
print("Best parameters: ", grid_search.best_params_)
# Use the best hyperparameters to train the model
best_rfc = grid_search.best_estimator_
best_rfc.fit(X_train_over, y_train_over)
# Predict the classes and probabilities on the test set
y_pred = best_rfc.predict(X_test_over)
y_proba = best_rfc.predict_proba(X_test_over)[:, 1]
# Compute the F1 score and AUC-ROC score
f1 = f1_score(y_test_over, y_pred)
auc_roc = roc_auc_score(y_test_over, y_proba)
# Print the F1 score and AUC-ROC score
print("F1 score: ", f1)
print("AUC-ROC score: ", auc_roc)
# Print the classification report
class_report = classification_report(y_test_over, y_pred)
print("Classification Report:\n", class_report)
Best parameters: {'criterion': 'gini', 'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 50}
F1 score: 0.6160957686381415
AUC-ROC score: 0.7407675449704938
Classification Report:
precision recall f1-score support
0.0 0.66 0.78 0.72 5232
1.0 0.70 0.55 0.62 4709
accuracy 0.67 9941
macro avg 0.68 0.67 0.67 9941
weighted avg 0.68 0.67 0.67 9941
#Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, roc_auc_score, classification_report
# Define the Gradient Boosting model
gbc = GradientBoostingClassifier(random_state=42)
# Define the hyperparameter grid to search over
param_grid = {'learning_rate': [0.01, 0.1, 1],
'n_estimators': [50, 100, 200],
'max_depth': [3, 4, 5]}
# Use GridSearchCV to find the best hyperparameters
gbc_grid = GridSearchCV(gbc, param_grid, scoring='roc_auc', cv=5)
gbc_grid.fit(X_train_over, y_train_over)
# Train the model with the best hyperparameters
gbc_best = GradientBoostingClassifier(random_state=42, **gbc_grid.best_params_)
gbc_best.fit(X_train_over, y_train_over)
# Make predictions on the test set
y_pred = gbc_best.predict(X_test_over)
y_pred_proba = gbc_best.predict_proba(X_test_over)[:,1] # probability scores for class 1
# Compute the evaluation metrics
f1 = f1_score(y_test_over, y_pred)
auc_roc = roc_auc_score(y_test_over, y_pred_proba)
report = classification_report(y_test_over, y_pred)
# Print the evaluation metrics
print("F1 score:", f1)
print("AUC-ROC score:", auc_roc)
print("Classification report:\n", report)
F1 score: 0.8839232746687761
AUC-ROC score: 0.9295179362441495
Classification report:
precision recall f1-score support
0.0 0.95 0.82 0.88 5232
1.0 0.83 0.95 0.88 4709
accuracy 0.88 9941
macro avg 0.89 0.89 0.88 9941
weighted avg 0.89 0.88 0.88 9941
#XGBoost
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import f1_score, roc_auc_score, classification_report
from sklearn.model_selection import GridSearchCV
# Define XGBoost classifier
xgb = XGBClassifier()
# Define parameter grid for hyperparameter tuning
param_grid = {
"learning_rate": [0.01, 0.05, 0.1],
"n_estimators": [100, 500, 1000],
"max_depth": [3, 5, 7]
}
# Define grid search with cross-validation
grid_search = GridSearchCV(xgb, param_grid=param_grid, cv=5, scoring='f1')
# Fit the grid search to the training data
grid_search.fit(X_train_over, y_train_over)
# Get the best parameters and score from the grid search
best_params = grid_search.best_params_
best_score = grid_search.best_score_
# Print the best parameters and score
print("Best parameters: ", best_params)
print("Best score: ", best_score)
# Train the XGBoost classifier with the best parameters
xgb_best = XGBClassifier(**best_params)
xgb_best.fit(X_train_over, y_train_over)
# Make predictions on the test set
y_pred = xgb_best.predict(X_test_over)
y_proba = xgb_best.predict_proba(X_test_over)[:,1]
# Calculate evaluation metrics
f1 = f1_score(y_test_over, y_pred)
roc_auc = roc_auc_score(y_test_over, y_proba)
class_report = classification_report(y_test_over, y_pred)
# Print evaluation metrics
print("F1 Score:", f1)
print("AUC-ROC Score:", roc_auc)
print("Classification Report:", class_report)
Best parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 1000}
Best score: 0.8903722920107275
F1 Score: 0.9078158349425755
AUC-ROC Score: 0.9666615362735032
Classification Report: precision recall f1-score support
0.0 0.95 0.87 0.91 5232
1.0 0.87 0.95 0.91 4709
accuracy 0.91 9941
macro avg 0.91 0.91 0.91 9941
weighted avg 0.91 0.91 0.91 9941
#LightGMB
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
from sklearn.model_selection import GridSearchCV
# Define LightGBM classifier
lgbm = lgb.LGBMClassifier()
# Define parameter grid for hyperparameter tuning
param_grid = {
"learning_rate": [0.01, 0.05, 0.1],
"n_estimators": [100, 500, 1000],
"max_depth": [3, 5, 7]
}
# Define grid search with cross-validation
grid_search = GridSearchCV(lgbm, param_grid=param_grid, cv=5, scoring='f1')
# Fit the grid search to the training data
grid_search.fit(X_train_over, y_train_over)
# Get the best parameters and score from the grid search
best_params = grid_search.best_params_
best_score = grid_search.best_score_
# Print the best parameters and score
print("Best parameters: ", best_params)
print("Best score: ", best_score)
# Train the LightGBM classifier with the best parameters
lgbm_best = lgb.LGBMClassifier(**best_params)
lgbm_best.fit(X_train_over, y_train_over)
# Make predictions on the test set
y_pred = lgbm_best.predict(X_test_over)
y_proba = lgbm_best.predict_proba(X_test_over)[:,1]
# Calculate evaluation metrics
f1 = f1_score(y_test_over, y_pred)
roc_auc = roc_auc_score(y_test_over, y_proba)
class_report = classification_report(y_test_over, y_pred)
# Print evaluation metrics
print("F1 Score:", f1)
print("AUC-ROC Score:", roc_auc)
print("Classification Report:", class_report)
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Best parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 1000}
Best score: 0.8703728527254787
F1 Score: 0.8851626016260162
AUC-ROC Score: 0.9504281037092742
Classification Report: precision recall f1-score support
0.0 0.93 0.85 0.89 5232
1.0 0.85 0.92 0.89 4709
accuracy 0.89 9941
macro avg 0.89 0.89 0.89 9941
weighted avg 0.89 0.89 0.89 9941
log_reg_pred = cross_val_predict(lr_best, X_train_over, y_train_over, cv=5,
method="decision_function")
best_rfc_pred=cross_val_predict(best_rfc, X_train_over, y_train_over, cv=5)
gbc_best_pred = cross_val_predict(gbc_best, X_train_over, y_train_over, cv=5)
xgb_best_pred = cross_val_predict(xgb_best, X_train_over, y_train_over, cv=5)
lgbm_best_pred = cross_val_predict(lgbm_best, X_train_over, y_train_over, cv=5)
tree_pred_pred = cross_val_predict(best_dtc, X_train_over, y_train_over, cv=5)
log_fpr, log_tpr, log_thresold = roc_curve(y_train_over, log_reg_pred)
rfc_fpr, rfc_tpr, rfc_threshold = roc_curve(y_train_over, best_rfc_pred)
gbc_fpr, gbc_tpr, gbc_threshold = roc_curve(y_train_over, gbc_best_pred)
xgb_fpr, xgb_tpr, xgb_threshold = roc_curve(y_train_over, xgb_best_pred)
lgbm_fpr, lgbm_tpr, lgbm_threshold = roc_curve(y_train_over, lgbm_best_pred)
tree_fpr, tree_tpr, tree_threshold = roc_curve(y_train_over, tree_pred_pred)
def graph_roc_curve_multiple(log_fpr, log_tpr, rfc_fpr, rfc_tpr, gbc_fpr, gbc_tpr, xgb_fpr, xgb_tpr, lgbm_fpr, lgbm_tpr, tree_fpr, tree_tpr):
plt.figure(figsize=(16,8))
plt.title('ROC Curve \n Top 6 Classifiers', fontsize=18)
plt.plot(log_fpr, log_tpr, label='Logistic Regression Classifier Score: {:.4f}'.format(roc_auc_score(y_train_over, log_reg_pred)))
plt.plot(rfc_fpr, rfc_tpr, label='Random Forest Classifier Score: {:.4f}'.format(roc_auc_score(y_train_over, best_rfc_pred)))
plt.plot(gbc_fpr, gbc_tpr, label='Gradient Boosting Classifier Score: {:.4f}'.format(roc_auc_score(y_train_over, gbc_best_pred)))
plt.plot(xgb_fpr, xgb_tpr, label='XGBoost Classifier Score: {:.4f}'.format(roc_auc_score(y_train_over, gbc_best_pred)))
plt.plot(lgbm_fpr, lgbm_tpr, label='Light GBM Classifier Score: {:.4f}'.format(roc_auc_score(y_train_over, lgbm_best_pred)))
plt.plot(tree_fpr, tree_tpr, label='Decision Tree Classifier Score: {:.4f}'.format(roc_auc_score(y_train_over, tree_pred_pred)))
plt.plot([0, 1], [0, 1], 'k--')
plt.axis([-0.01, 1, 0, 1])
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.annotate('Minimum ROC Score of 50% \n (This is the minimum score to get)', xy=(0.5, 0.5), xytext=(0.6, 0.3),
arrowprops=dict(facecolor='#6E726D', shrink=0.05),
)
plt.legend()
graph_roc_curve_multiple(log_fpr, log_tpr, rfc_fpr, rfc_tpr, gbc_fpr, gbc_tpr, xgb_fpr, xgb_tpr, lgbm_fpr, lgbm_tpr, tree_fpr, tree_tpr)
plt.show()
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Comment — Results Interpretation
Summarizing the result we can define the end results:
Undersampling:
Oversampling:
The conclusions:
Why XGBoost performed the best? Some possible reasons: